Systems and methods for providing interactive speaker identification training

ABSTRACT

A system ( 100 ) provides speaker identification training. The system ( 100 ) generates speaker models and receives audio segments. The system ( 100 ) identifies speakers corresponding to the audio segments based on the speaker models. At least one of the audio segments has an unidentified or misidentified speaker (i.e., an audio segment whose speaker cannot be accurately identified). The system ( 100 ) presents, to a user, audio segments that include an audio segment whose speaker is unidentified or misidentified and receives, from the user, the name of the unidentified or misidentified speaker. The system ( 100 ) may use this information to subsequently identify the unidentified or misidentified speaker by name for future audio segments.

RELATED APPLICATION

[0001] This application claims priority under 35 U.S.C. § 119 based onU.S. Provisional Application No. 60/419,214, filed Oct. 17, 2002, thedisclosure of which is incorporated herein by reference.

[0002] This application is related to U.S. patent application, Ser. No.10/______ (Docket No. 02-4042), entitled “Continuous Learning for SpeechRecognition Systems,” filed concurrently herewith, and U.S. patentapplication, Ser. No. 10/610,533 (Docket No. 02-4046), entitled “Systemsand Methods for Improving Recognition Results via User-Augmentation of aDatabase,” filed Jul. 2, 2003, the disclosures of which are incorporatedherein by reference.

BACKGROUND OF THE INVENTION

[0003] 1. Field of the Invention

[0004] The present invention relates generally to multimediaenvironments and, more particularly, to systems and methods forproviding interactive speech identification training in a multimediaenvironment.

[0005] 2. Description of Related Art

[0006] Conventional speaker identification systems require a huge amountof training data to identify speakers based on audio data received fromthe speakers. These systems typically train a set of models based on thetraining data. When audio data is received by one of these systems, thesystem segments the audio data into speaker turns. The system thenanalyzes each of the speaker turns to determine whether it matches oneof the models. If a speaker turn matches one of the models, the systemlabels the speaker turn with the name of the speaker.

[0007] For proper speaker identification, the system requires a hugeamount of training data. For each speaker, for example, the systemrequires several minutes of speech to generate a Gaussian mixture modelthat can later be used to locate segments of speech from the samespeaker. In some situations, it is difficult to obtain a sufficientamount of training data to generate an accurate Gaussian mixture model.For example, it is difficult to obtain sufficient training data fromspeakers who do not routinely speak or who speak in only small bursts(i.e., one to two sentences at a time).

[0008] As a result, there is a need for mechanisms to improve speakeridentification results even for speakers where there is insufficienttraining data.

SUMMARY OF THE INVENTION

[0009] Systems and methods consistent with the present invention collectdata for training a speaker identification system in an interactivemanner. Users may be prompted to identify speaker turns for unidentifiedor incorrectly identified speakers. The data from the users may be usedto generate new speaker models and/or update links to existing speakermodels. The speaker models may then be used to correctly label speakerturns.

[0010] In one aspect consistent with the principles of the invention, asystem provides speaker identification training. The system generatesspeaker models and receives audio segments. The system identifiesspeakers corresponding to the audio segments based on the speakermodels. At least one of the audio segments has an unidentified ormisidentified speaker (i.e., an audio segment whose speaker cannot beaccurately identified). The system presents, to a user, audio segmentsthat include an audio segment whose speaker is unidentified ormisidentified and receives, from the user, the name of the unidentifiedor misidentified speaker. The system may use this information tosubsequently identify the unidentified or misidentified speaker by namefor future audio segments.

[0011] In another aspect consistent with the principles of theinvention, a speaker identification system is provided. The systemincludes an indexer and a server. The indexer is configured to generatespeaker models, receive audio segments, and identify speakerscorresponding to the audio segments based on the speaker models. Theindexer is unable to correctly identify at least one of the speakerscorresponding to the audio segments. The server is configured toreceive, from a user, the name of an unidentified speaker of thespeakers corresponding to the audio segments, and provide the name ofthe unidentified speaker to the indexer for identification of theunidentified speaker in subsequent audio segments.

[0012] In a further aspect consistent with the principles of theinvention, a computer-readable medium stores instructions executable byone or more processors for speaker identification training by a speakeridentification system. The computer-readable medium includesinstructions for generating speaker models based on training data;instructions for presenting, to a user, audio segments for which nospeakers can be identified from the speaker models; instructions forobtaining, from the user, a name of a speaker for at least one of theaudio segments; instructions for generating a new speaker model for thespeaker based on the at least one audio segment; and instructions forassociating the name of the speaker with the new speaker model.

[0013] In another aspect consistent with the principles of theinvention, a speaker identification system includes an indexer, adatabase, and a server. The indexer receives speech segments, where eachof the speech segments has a corresponding speaker. The indexer alsocreates documents by transcribing the speech segments and identifies thenames of the speakers corresponding to the speech segments. The indexeris unable to correctly identify the names of at least one of thespeakers corresponding to the speech segments. As a result, some of thespeakers are identified speakers and others of the speakers areunidentified speakers. The database stores the documents. The serverretrieves one or more of the documents from the database and presentsthe one or more of the documents to a user. The server receives, fromthe user, the name of one of the unidentified speakers and provides thename of the unidentified speaker to the indexer for subsequentidentification of speech segments from the unidentified speaker.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] The accompanying drawings, which are incorporated in andconstitute a part of this specification, illustrate the invention and,together with the description, explain the invention. In the drawings,

[0015]FIG. 1 is a diagram of a system in which systems and methodsconsistent with the present invention may be implemented;

[0016]FIG. 2 is an exemplary diagram of the indexer of FIG. 1 accordingto an implementation consistent with the principles of the invention;

[0017]FIG. 3 is an exemplary diagram of a portion of the recognitionsystem of FIG. 2 according to an implementation consistent with thepresent invention;

[0018]FIG. 4 is an exemplary diagram of the memory system of FIG. 1according to an implementation consistent with the principles of theinvention;

[0019] FIGS. 5A-5C are flowcharts of exemplary processing for speakeridentification training according to an implementation consistent withthe principles of the invention; and

[0020]FIG. 6 is a diagram of an exemplary graphical user interface bywhich a document may be presented to a user according to animplementation consistent with the principles of the invention.

DETAILED DESCRIPTION

[0021] The following detailed description of the invention refers to theaccompanying drawings. The same reference numbers in different drawingsmay identify the same or similar elements. Also, the following detaileddescription does not limit the invention. Instead, the scope of theinvention is defined by the appended claims and equivalents.

[0022] Systems and methods consistent with the present invention permitusers to aid in the training of a speaker identification system by, forexample, correctly identifying unidentified and misidentified speakers.The systems and methods may use the user-identification of unidentifiedspeakers to train new speaker models for improved recognition results.The systems and methods may use the user-identification of misidentifiedspeakers to update links to existing speaker models.

EXEMPLARY SYSTEM

[0023]FIG. 1 is a diagram of an exemplary system 100 in which systemsand methods consistent with the present invention may be implemented.System 100 may include multimedia (MM) sources 110, indexer 120, memorysystem 130, and server 140 connected to clients 150 via network 160.Network 160 may include any type of network, such as a local areanetwork (LAN), a wide area network (WAN) (e.g., the Internet), a publictelephone network (e.g., the Public Switched Telephone Network (PSTN)),a virtual private network (VPN), or a combination of networks. Thevarious connections shown in FIG. 1 may be made via wired, wireless,and/or optical connections.

[0024] Multimedia sources 110 may include one or more audio sourcesand/or one or more video sources. An audio source may include any sourceof audio data, such as radio, telephone, and conversations in anylanguage. A video source may include any source of video data withintegrated audio data in any language, such as television, satellite,and a camcorder. The audio and/or video data may be provided to indexer120 as a stream or file.

[0025] Indexer 120 may include mechanisms for processing audio and/orvideo data. Indexer 120 may include mechanisms that receive data frommultimedia sources 110, process the data, perform feature extraction,and output analyzed, marked-up, and enhanced language metadata. In oneimplementation consistent with the principles of the invention, indexer120 includes mechanisms, such as the ones described in John Makhoul etal., “Speech and Language Technologies for Audio Indexing andRetrieval,” Proceedings of the IEEE, Vol. 88, No. 8, August 2000, pp.1338-1353, which is incorporated herein by reference.

[0026]FIG. 2 is an exemplary diagram of indexer 120 according to animplementation consistent with the principles of the invention. Indexer120 may include training system 210, statistical model 220, andrecognition system 230. Training system 210 may include logic thatestimates parameters of statistical model 220 from a corpus of trainingdata. The training data may initially include human-produced data. Forexample, the training data might include one hundred hours of audio datathat has been meticulously and accurately transcribed by a human.Training system 210 may use the training data to generate parameters forstatistical model 220 that recognition system 230 may later use torecognize future data that it receives (i.e., new audio that it has notheard before).

[0027] The training data might also include audio data for which thespeaker has been identified. Training system 210 may use the trainingdata to generate parameters for statistical speaker models thatrecognition system 230 may later use to recognize speakers from futuredata that it receives. To build a speaker model for a new speaker,training system 210 may use a speaker independent Gaussian mixture modelwith approximately 2,048 Gaussians. This Gaussian mixture model may betrained based on a lot of diversified speakers (e.g., speakers ofdifferent ages, different genders, etc.). Training system 210 may use aconventional expectation and maximization process to fit audio from aspeaker to the model. Training system 210 may then use a conventionalmaximum a posteriori adaptation process to generate a final model forthe speaker.

[0028] Statistical model 220 may include acoustic models, languagemodels, and speaker models. The acoustic models may describe thetime-varying evolution of feature vectors for each sound or phoneme. Theacoustic models may employ continuous hidden Markov models (HMMs) tomodel each of the phonemes in the various phonetic contexts.

[0029] The language models may include n-gram language models, where theprobability of each word is a function of the previous word (for abi-gram language model) and the previous two words (for a tri-gramlanguage model). Typically, the higher the order of the language model,the higher the recognition accuracy at the cost of slower recognitionspeeds.

[0030] The speaker models may include a pool of models that are used toidentify speakers from their speech. Speech from a particular speakermay be compared to the speaker models to determine the likelihood thatthe speech was produced from each of the models. The model with thehighest likelihood may be determined to correspond to that speech. Theparticular speaker corresponding to the model may be determined based,for example, on a mapping function that links speaker names to speakermodels.

[0031] Recognition system 230 may use statistical model 220 to processinput audio data. FIG. 3 is an exemplary diagram of a portion ofrecognition system 230 according to an implementation consistent withthe principles of the invention. Recognition system 230 may includespeaker segmentation logic 310, speech recognition logic 320, speakerclustering logic 330, and speaker identification logic 340. In oneimplementation consistent with the principles of the invention, thefunctions performed by speaker segmentation logic 310, speechrecognition logic 320, speaker clustering logic 330, and speakeridentification logic 340 are similar to the functions described in JohnMakhoul et al., “Speech and Language Technologies for Audio Indexing andRetrieval,” Proceedings of the IEEE, Vol. 88, No. 8, August 2000, pp.1338-1353, which was previously incorporated herein by reference.

[0032] Generally, speaker segmentation logic 310 may distinguish speechfrom silence, noise, and other audio signals in input audio data. Forexample, speaker segmentation logic 310 may analyze each thirty secondwindow of the input data to determine whether it contains speech.Speaker segmentation logic 310 may also identify boundaries betweenspeakers in the input stream. Speaker segmentation logic 310 may groupspeech segments from the same speaker and send the segments to speechrecognition logic 320.

[0033] Speech recognition logic 320 may perform continuous speechrecognition to recognize the words spoken in the segments that itreceives from speaker segmentation logic 310. Speech recognition logic320 may generate a transcription of the speech using statistical model220. Speaker clustering logic 330 may identify all of the segments fromthe same speaker in a single document (i.e., a body of media that iscontiguous in time (from beginning to end or from time A to time B)) andgroup them into speaker clusters (or speaker turns). Speaker clusteringlogic 330 may then assign each of the speaker turns a unique label.

[0034] Speaker identification logic 340 may identify the speaker in eachspeaker turn by name or, when the name cannot be determined, by gender.Speaker identification logic 340 may use the speaker models withinstatistical model 220 to identify speakers from their speech. Forexample, speaker identification logic 340 may compare speech from aparticular speaker within a speaker turn to the speaker models todetermine the likelihood that the speech was produced from each of themodels. Speaker identification logic 340 may identify the model with thehighest likelihood (above some threshold) as the model corresponding tothat speech. Speaker identification logic 340 may include a mappingfunction that links speaker names to the speaker models so that speakeridentification logic 340 can identify the speaker by name once a modelhas been identified.

[0035] Returning to FIG. 1, memory system 130 may store documents fromindexer 120 and, possibly, documents from clients 150. FIG. 4 is anexemplary diagram of memory system 130 according to an implementationconsistent with the principles of the invention. Memory system 130 mayinclude loader 410, trainer 420, one or more databases 430, andinterface 440. Loader 410 may include logic that receives documents fromindexer 120 and stores them in database 430. Trainer 420 may includelogic that sends speaker identification information, such as audio datawith identification of the corresponding speaker, in the form oftraining data to indexer 120.

[0036] Database 430 may include a conventional database, such as arelational database, that stores documents from indexer 120. Database430 may also store documents received from clients 150 via server 140.Interface 440 may include logic that interacts with server 140 to storedocuments in database 130, query or search database 130, and retrievedocuments from database 130.

[0037] Returning to FIG. 1, server 140 may include a computer (e.g., aprocessor and memory) or another device that is capable of interactingwith memory system 130 and clients 150 via network 170. Server 140 mayreceive queries from clients 150 and use the queries to retrieverelevant documents from memory system 130.

[0038] Clients 150 may include personal computers, laptops, personaldigital assistants, or other types of devices that are capable ofinteracting with server 140 to retrieve documents from memory system130. Clients 150 may present information to users via a graphical userinterface, such as a web browser window.

EXEMPLARY PROCESSING

[0039] Systems and methods consistent with the present invention permitusers to assist in speaker identification training to improverecognition results of system 100. For example, the user may supply thename of an unidentified or misidentified speaker that may be used toretrain indexer 120.

[0040] FIGS. 5A-5C are flowcharts of exemplary processing for speakeridentification training according to an implementation consistent withthe principles of the invention. Processing may begin with a userdesiring to retrieve one or more documents from memory system 130. Theuser may use a conventional web browser of client 150 to access server140 in a conventional manner. To obtain documents of interest, the usermay generate a search query and send the query to server 140 via client150. Server 140 may use the query to search memory system 130 andretrieve relevant documents.

[0041] Server 140 may present the relevant documents to the user. Forexample, the user may be presented with a list of relevant documents.The documents may include any combination of audio documents and videodocuments. The user may select one or more documents on the list toview. When this happens, the user may be presented with a transcriptioncorresponding to the audio data from the document. The user may also bepresented with the audio and/or video data corresponding to thetranscription.

[0042]FIG. 6 is a diagram of an exemplary graphical user interface (GUI)600 by which a document may be presented to a user according to animplementation consistent with the principles of the invention. In oneimplementation, GUI 600 is part of an interface of a standard Internetbrowser, such as Internet Explorer or Netscape Navigator, or any browserthat follows World Wide Web Consortium (W3C) specifications for HTML.

[0043] GUI 600 may include a speaker section 610 and a transcriptionsection 620. Speaker section 610 may identify boundaries betweenspeakers, the gender of a speaker, and the name of a speaker (whenknown). In this way, speaker segments are clustered together over theentire document to group together segments from the same speaker underthe same label. In the example of FIG. 6, one speaker, Elizabeth Vargas,has been identified by name. Other speakers have been identified bygender and number. For example, if “male 1” speaks again within the samedocument, his speech will also be labeled “male 1.” Transcriptionsection 620 may include a transcription of the document. In the exampleof FIG. 6, the document corresponds to video data from a televisionbroadcast of ABC's World News Tonight.

[0044] GUI 600 may also include playback button 630. The user may selectplayback button 630 to hear the audio (or possibly see the video)corresponding to the transcription in transcription section 620. Tofacilitate this, the user may select a portion of the transcription intranscription section 620 by, for example, using a mouse to highlightthe portion. The user may then select playback button 630 to hear theaudio (or see the video) corresponding to the selected portion of thetranscription.

[0045] GUI 600 may further include a correction button 640. The user mayselect correction button 640 when the user desires to supply the name ofan unidentified speaker or correct the name of a misidentified speakerwithin speaker section 610. Sometimes, a speaker may be identified onlyby gender in speaker section 610 because indexer 120 (FIG. 1) canidentify the speaker's gender but cannot determine the speaker's name.This may occur when indexer 120 has not yet generated a speaker modelcorresponding to the speaker due, for example, to a lack of audio datafor the speaker. Sometimes, a speaker may be misidentified (i.e.,labeled with the incorrect speaker name) in speaker section 610. Thismay occur when indexer 120 incorrectly matches the audio data from thespeaker to a speaker model of another speaker. If the user desires, theuser may provide the name of an unidentified speaker or correct the nameof a misidentified speaker by selecting correction button 640 andproviding the correct information.

[0046] GUI 600 may receive the information provided by the user andmodify the document onscreen. This way, the user may determine whetherthe information was correctly provided. GUI 600 may also send thespeaker identification information to server 140.

[0047] Returning to FIG. 5A, the user may express a desire to providethe name of an unidentified speaker or correct the name of amisidentified speaker. In doing so, the user may identify a segment,such as a speaker turn, within transcription section 620 (FIG. 6) (act505). The user may identify the speaker turn by selecting thecorresponding segment using, for example, a mouse or by selecting one ofthe labels within speaker section 610. The user may then selectcorrection button 640 to initiate the correction.

[0048] Client 150 may determine whether the user has selected toidentify the name of an unidentified speaker or correct the name of amisidentified speaker (act 510). Client 150 may make this determinationbased on the current speaker label provided in speaker section 610 forthe segment identified by the user. For example, if the current speakerlabel includes a gender identification, then client 150 may determinethat the user has selected to identify the name of an unidentifiedspeaker. If, on the other hand, the current speaker label includessomething other than a gender identification, then client 150 maydetermine that the user has selected to correct the name of amisidentified speaker.

[0049] If the user desires to provide the name of an unidentifiedspeaker, the user may input the name in any conventional manner (act515) (FIG. 5B). Client 150 may provide the name and, possibly, otheridentifying information, as speaker identification information, toserver 140. The other identifying information may include, for example,an indication of the audio segment to which the name corresponds, theaudio (or video) data corresponding to the segment, and/or otherinformation that may be useful for training indexer 120 to lateridentify audio data from this same speaker.

[0050] Server 140 may determine whether additional audio data isnecessary or desired (act 520). For example, server 140 may determinewhether the amount of audio data associated with the audio segmentlabeled by the user is less than approximately four minutes of audio. Asexplained above, the minimum amount of audio data needed to build a goodGaussian mixture model (with over 2,000 Gaussians) is approximately fourminutes of audio.

[0051] If the audio data associated with the audio segment is less thanfour minutes in length, then server 140 may locate additional audiosegments that may be from the same speaker (act 525). Server 140 mayidentify these additional audio segments based on the speaker clusteringperformed by indexer 120. Using the speaker clustering, more, similaraudio segments may be identified in the same document and in otherdocuments within memory system 130.

[0052] For each audio segment located, sever 140 may confirm with theuser that the audio segment is from the same speaker identified by theuser (act 530). Server 140 may continue to obtain additional audiosegments and confirm them with the user until it obtains around fourminutes or more of audio data or the user expresses a desire to ceasethe confirmations.

[0053] Once server 140 collects a sufficient amount of audio data or theuser stops the confirmations, server 140 may provide the audio data,along with the name of the speaker, to indexer 120 as training data.Indexer 120 may use the audio data to build a speaker model for the newspeaker (act 535). As described above, indexer 120 may use a speakerindependent Gaussian mixture model with approximately 2,048 Gaussians.This Gaussian mixture model may be trained based on a lot of diversifiedspeakers (e.g., speakers of different ages, different genders, etc.).Indexer 120 may use a conventional expectation and maximization processto fit the audio data from the speaker to the model. Indexer 120 maythen use a conventional maximum a posteriori adaptation process togenerate a final model for the speaker.

[0054] Indexer 120 may add the speaker model to the pool of speakermodels within statistical model 220 (FIG. 2) (act 540). Thereafter,indexer 120 may correctly identify new audio data that it receivescorresponding to the new speaker.

[0055] If, instead of providing the name of an unidentified speaker, theuser desires to correct the name of a misidentified speaker, the usermay optionally input the correct name in any conventional manner (act545) (FIG. 5C). Client 150 may provide the correct name and, possibly,other identifying information, as speaker identification information, toserver 140. The other identifying information may include, for example,an indication of the segment (e.g., speaker turn) to which the namecorresponds, the audio (or video) data corresponding to the segment,and/or other information that may be useful for indexer 120 to identifythe speaker model corresponding to the speaker.

[0056] Server 140 may determine whether confirmation of additional audiodata is necessary or desired (act 550). For example, a certain amount ofaudio data may be needed to accurately identify the incorrectly labeledspeaker model. In this case, server 140 may identify other audiosegments that are close to the current segment (i.e., the speaker turnthat the user identified as mislabeled) (act 555). Server 140 may useinformation relating to the speaker clustering performed by indexer 120to identify these other potentially mislabeled audio segments. For eachsegment located, sever 140 may confirm with the user that the speakerturn is also labeled incorrectly (act 560). Server 140 may continue toobtain additional audio segments and confirm them with the user until ituntil some threshold is met or the user expresses a desire to cease theconfirmations.

[0057] Server 140 may then provide the audio data, along with thecorrect name of the speaker, to indexer 120. Indexer 120 may use theaudio data to identify the speaker model from which the audio data wasproduced. Indexer 120 may then update the name associated with theidentified speaker model (act 565). For example, a mapping may existbetween the names of speakers and the corresponding speaker models instatistical model 220. Indexer 120 may update the mapping to identifythe correct name of the speaker for that model.

[0058] In an alternate implementation, the user knows only that thespeaker is misidentified, but does not know the true identity of thespeaker. For example, the user may indicate simply that the name of thespeaker associated with the speaker turn is incorrect. In this case,indexer 120 may simply remove the mapping of the incorrect speaker nameto the corresponding speaker model.

[0059] When indexer 120 thereafter comes across audio data correspondingto a speaker that has been identified by the user (either as a result ofan unidentified or misidentified speaker turn), indexer 120 compares theaudio data to the models in the speaker model pool. Indexer 120 maydetermine the likelihood that the audio data was created by each of thespeaker models. Indexer 120 may identify the speaker model with thehighest likelihood (above some threshold) as the speaker modelcorresponding to the audio data. Indexer 120 may then use its mapping ofspeaker models to speaker names to identify the speaker of that audiodata by name. Indexer 120 may also update the documents in memory system130 with the speaker name(s) identified by the user.

CONCLUSION

[0060] Systems and methods consistent with the present invention providean interactive speaker identification training system. Via these systemsand methods, users are permitted to assist in the speaker identificationtraining by, for example, correctly identifying unidentified andmisidentified speakers. The systems and methods may use theuser-identification of unidentified speakers to train new speaker modelsfor improved recognition results. The systems and methods may use theuser-identification of misidentified speakers to update links toexisting speaker models.

[0061] The foregoing description of preferred embodiments of the presentinvention provides illustration and description, but is not intended tobe exhaustive or to limit the invention to the precise form disclosed.Modifications and variations are possible in light of the aboveteachings or may be acquired from practice of the invention.

[0062] For example, an exemplary graphical user interface has beendescribed with regard to FIG. 6 as containing certain featuresconsistent with the principles of the invention. It is to be understoodthat a graphical user interface, consistent with the present invention,may include any or all of these features or different features tofacilitate user assistance in speaker identification training.

[0063] Further, a graphical user interface has been described asperforming certain functions. It is to be understood that some, if notall, of these functions may be performed by client 150 or server 140.

[0064] While series of acts have been described with regard to FIGS.5A-5C, the order of the acts may differ in other implementationsconsistent with the principles of the invention.

[0065] Further, certain portions of the invention have been described as“logic” that performs one or more functions. This logic may includehardware, such as an application specific integrated circuit or a fieldprogrammable gate array, software, or a combination of hardware andsoftware.

[0066] No element, act, or instruction used in the description of thepresent application should be construed as critical or essential to theinvention unless explicitly described as such. Also, as used herein, thearticle “a” is intended to include one or more items. Where only oneitem is intended, the term “one” or similar language is used. The scopeof the invention is defined by the claims and their equivalents.

What is claimed is:
 1. A speaker identification system, comprising: anindexer configured to: generate a plurality of speaker models, receive aplurality of audio segments, and identify speakers corresponding to theaudio segments based on the speaker models, the indexer being unable tocorrectly identify at least one of the speakers, as an unidentifiedspeaker, corresponding to the audio segments; and a server configuredto: receive, from a user, the name of the unidentified speaker, andprovide the name of the unidentified speaker to the indexer foridentification of the unidentified speaker in subsequent audio segments.2. The system of claim 1, wherein the indexer is further configured to:generate a new speaker model for the unidentified speaker based on anaudio segment corresponding to the unidentified speaker.
 3. The systemof claim 1, wherein the indexer is further configured to: generatelabels for the audio segments, the labels being based on names ofspeakers that can be identified and gender of speakers that cannot beidentified.
 4. The system of claim 3, wherein when receiving the name ofan unidentified speaker, the server is configured to: present a documentto the user, the document including a transcription of a plurality ofthe audio segments and the labels for the plurality of the audiosegments, and receive, from the user, the name of one of the speakersthat cannot be identified.
 5. The system of claim 4, wherein whenpresenting a document, the server is further configured to: provideaudio data corresponding to at least one of the plurality of the audiosegments to the user.
 6. The system of claim 1, wherein the server isfurther configured to: locate one or more additional audio segments fromthe unidentified speaker, and present the one or more additional audiosegments to the user for confirmation that the one or more additionalaudio segments were produced by the unidentified speaker.
 7. The systemof claim 6, wherein when presenting the one or more additional audiosegments, the server is configured to continue to present audio segmentsto the user for confirmation until at least four minutes of audio datais obtained.
 8. The system of claim 6, wherein the unidentified speakercorresponds to at least one of the audio segments; and wherein whenlocating one or more additional audio segments, the server is configuredto: find one or more additional audio segments similar to the at leastone of the audio segments.
 9. The system of claim 1, wherein the indexeris configured to: fit audio data from the unidentified speaker to aspeaker independent Gaussian mixture model using an expectation andmaximization process, and generate a new speaker model for theunidentified speaker using a maximum a posteriori adaptation process.10. The system of claim 1, wherein the unidentified speaker is amisidentified speaker of one of the audio segments; and wherein whenreceiving the name, the server is configured to: receive, from the user,a correct name of a speaker of the one of the audio segments.
 11. Thesystem of claim 10, wherein the indexer is further configured to:identify one of the speaker models, as an identified speaker model, thatcorresponds to the one of the audio segments, and update a labelassociated with the identified speaker model to include the correct nameof the speaker of the one of the audio segments.
 12. The system of claim10, wherein the server is further configured to: locate one or moreadditional audio segments similar to the one of the audio segments, andpresent the one or more additional audio segments to the user forconfirmation that the one or more additional audio segments wereproduced by the speaker of the one of the audio segments.
 13. A speakeridentification system, comprising: means for generating a plurality ofspeaker models; means for receiving a plurality of audio segments; meansfor identifying speakers corresponding to the audio segments based onthe speaker models, at least one of the audio segments being associatedwith an unidentified or misidentified speaker; means for labeling theaudio segments with names of the speakers that can be identified; meansfor presenting a plurality of the audio segments, including the at leastone of the audio segments, with the labels to a user; means forreceiving, from the user, the name of the unidentified or misidentifiedspeaker; and means for identifying the unidentified or misidentifiedspeaker by name in future audio segments.
 14. A method for providingspeaker identification training, comprising: generating a plurality ofspeaker models; receiving a plurality of audio segments; identifyingspeakers corresponding to the audio segments based on the speakermodels, at least one of the audio segments being associated with anunidentified or misidentified speaker; presenting a plurality of theaudio segments, including the at least one of the audio segments, to auser; receiving, from the user, the name of the unidentified ormisidentified speaker; and identifying the unidentified or misidentifiedspeaker by name for future audio segments.
 15. The method of claim 14,wherein the unidentified or misidentified speaker is an unidentifiedspeaker; and wherein the method further comprises: generating a newspeaker model for the unidentified speaker based on the at least one ofthe audio segments.
 16. The method of claim 14, further comprising:generating labels for the audio segments, the labels being based onnames of speakers that can be identified and gender of speakers thatcannot be identified.
 17. The method of claim 16, wherein the presentinga plurality of the audio segments includes: providing a document to theuser, the document including a transcription of the plurality of theaudio segments and the labels for the plurality of the audio segments.18. The method of claim 17, wherein the providing a document includes:providing audio data corresponding to one or more of the plurality ofthe audio segments to the user.
 19. The method of claim 14, furthercomprising: locating one or more additional audio segments from theunidentified or misidentified speaker, and presenting the one or moreadditional audio segments to the user for confirmation that the one ormore additional audio segments were produced by the unidentified ormisidentified speaker.
 20. The method of claim 19, wherein thepresenting the one or more additional audio segments includes:presenting audio segments to the user for confirmation until at leastfour minutes of audio data is obtained.
 21. The method of claim 19,wherein the locating one or more additional audio segments includes:finding one or more additional audio segments similar to the at leastone of the audio segments.
 22. The method of claim 14, wherein theunidentified or misidentified speaker is an unidentified speaker; andwherein the method further comprises: fitting audio data from theunidentified speaker to a speaker independent Gaussian mixture modelusing an expectation and maximization process; and generating a newspeaker model for the unidentified speaker using a maximum a posterioriadaptation process.
 23. The method of claim 14, wherein the unidentifiedor misidentified speaker is a misidentified speaker; and wherein thereceiving the name includes: receiving, from the user, a correct name ofa speaker of the at least one of the audio segments.
 24. The method ofclaim 23, further comprising: identifying one of the speaker models, asan identified speaker model, that corresponds to the at least one of theaudio segments; and updating a label associated with the identifiedspeaker model to include the correct name of the speaker of the at leastone of the audio segments.
 25. A computer-readable medium that storesinstructions executable by one or more processors for speakeridentification training by a speaker identification system, comprising:instructions for generating a plurality of speaker models based ontraining data; instructions for presenting, to a user, audio segmentsfor which no speakers can be identified from the speaker models;instructions for obtaining, from the user, a name of a speaker for atleast one of the audio segments; instructions for generating a newspeaker model for the speaker based on the at least one of the audiosegments; and instructions for associating the name of the speaker withthe new speaker model.
 26. A speaker identification system, comprising:an indexer configured to: receive a plurality of speech segments, eachof the speech segments being associated with a corresponding speaker,create a plurality of documents by transcribing the speech segments,identify names of the speakers corresponding to the speech segments, theindexer being unable to correctly identify names of at least one of thespeakers corresponding to the speech segments, the speakers for whichthe indexer can correctly identify names being identified speakers andthe speakers for which the indexer cannot correctly identify names beingunidentified speakers; a database configured to store the documents; anda server configured to: retrieve one or more of the documents from thedatabase, present the one or more of the documents to a user, receive,from the user, a name for one of the unidentified speakers, and providethe name for the one of the unidentified speakers to the indexer forsubsequent identification of speech segments from the one of theunidentified speakers.